Goto

Collaborating Authors

 english speech


From A for algebra to T for tariffs: Arabic words used in English speech

Al Jazeera

Arabic is one of the world's most widely spoken languages with at least 400 million speakers, including 200 million native speakers and 200 million to 250 million non-native speakers. Modern Standard Arabic (MSA) serves as the formal language for government, legal matters and education, and it is widely used in international and religious contexts. Additionally, more than 25 dialects are spoken primarily across the Middle East and North Africa. The date was chosen to mark the day in 1973 on which the UN General Assembly adopted Arabic as one of its six official languages. In the following visual explainer, Al Jazeera lists some of the most common words in today's English language that originated from Arabic or passed through Arabic before reaching English.

  Country:
  Industry:

Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

Rashidi, Sina, Sameti, Hossein

arXiv.org Artificial Intelligence

Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English


Incorporating L2 Phonemes Using Articulatory Features for Robust Speech Recognition

Wang, Jisung, Lee, Haram, Oh, Myungwoo

arXiv.org Artificial Intelligence

The limited availability of non-native speech datasets presents a major challenge in automatic speech recognition (ASR) to narrow the performance gap between native and non-native speakers. To address this, the focus of this study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis. This not only enables accurate modeling of pronunciation variants but also allows for the utilization of both native Korean and English speech datasets. We employ the lattice-free maximum mutual information (LF-MMI) objective in an end-to-end manner, to train the acoustic model to align and predict one of multiple pronunciation candidates. Experimental results show that the proposed method improves ASR accuracy for Korean L2 speech by training solely on L1 speech data. Furthermore, fine-tuning on L2 speech improves recognition accuracy for both L1 and L2 speech without performance trade-offs.


M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

Berry, Layne, Shih, Yi-Jen, Wang, Hsuan-Fu, Chang, Heng-Jui, Lee, Hung-yi, Harwath, David

arXiv.org Artificial Intelligence

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages. We identify key differences in model behavior and performance between English and non-English settings, attributable to the English-only pre-training of CLIP and HuBERT, and investigate how fine-tuning the pre-trained models impacts these differences. Finally, we show that our models can be used for mono- and cross-lingual speech-text retrieval and cross-lingual speech-speech retrieval, despite never having seen any parallel speech-text or speech-speech data during training.


A model that can recognize speech in different languages from a speaker's lip movements

#artificialintelligence

In recent years, deep learning techniques have achieved remarkable results in numerous language and image-processing tasks. This includes visual speech recognition (VSR), which entails identifying the content of speech solely by analyzing a speaker's lip movements. While some deep learning algorithms have achieved highly promising results on VSR tasks, they were primarily trained to detect speech in English, as most existing training datasets only include English speech. This limits their potential user base to people who live or work in English-speaking contexts. Researchers at Imperial College London have recently developed a new model that can tackle VSR tasks in multiple languages.


A model that can recognize speech in different languages from a speaker's lip movements

#artificialintelligence

In recent years, deep learning techniques have achieved remarkable results in numerous language and image-processing tasks. This includes visual speech recognition (VSR), which entails identifying the content of speech solely by analyzing a speaker's lip movements. While some deep learning algorithms have achieved highly promising results on VSR tasks, they were primarily trained to detect speech in English, as most existing training datasets only include English speech. This limits their potential user base to people who live or work in English-speaking contexts. Researchers at Imperial College London have recently developed a new model that can tackle VSR tasks in multiple languages.